Today we are going to talk about running some basic statistics in R.

library(tidyverse)
library(plotly)
library(cowplot)

********************************************************
Note: As of version 1.0.0, cowplot does not change the
  default ggplot2 theme anymore. To recover the previous
  behavior, execute:
  theme_set(theme_cowplot())
********************************************************

Usually, when I want to run stats on a thing, I start with some basic descriptive stats. Fortunately we did this last week with the starwars dataset, so lets jump back in.

data(starwars)

Lots of statistics are applied to continuous variables. Lets take a look at the relationship between height and mass. I’m going to make an interactive plot with ggplotly so we can mouse over points and see them better.

Here I included a name and spec variable, but didn’t map them to anything. Thats so you can see them when you mouse over them.

hmPlot <- starwars %>% ggplot(aes(x = height, y = mass, name = name, spec = species)) + geom_point() + theme_cowplot()
ggplotly(hmPlot)

Woah dude. One character is anomalously heavy. Mouse over the outlier so you can see who it is.

That was my first guess too

That was my first guess too

Jabba is probably going to screw up all of our stats, lets focus on the relationship of all of the other charactersby removing him.

starwars01 <- starwars %>% filter(str_detect(name, "Jabba")) # Complecated syntac because I can't spell the rest of his name
hmPlot01 <- starwars01 %>% ggplot(aes(x = height, y = mass, name = name, spec = species)) + geom_point() + theme_cowplot()
ggplotly(hmPlot01)

So here’s our relationship. It looks sort of, but not reeally linear, which makes sense, there are lots of species in the galaxy.

Sometimes when we do stats, the stats like to imagine normally distributed data.

starwars01 %>% pivot_longer(cols = mass:height, names_to = "measurement", values_to = "value") %>% 
  ggplot(aes(x = value)) + facet_wrap(measurement~.) + geom_histogram()

Hmm. Sort of.

See if you can figure out what I did above.

Lets see if height and width are correlated.

cor.test(starwars01$height, starwars01$mass, method = "pearson")

    Pearson's product-moment correlation

data:  starwars01$height and starwars01$mass
t = 8.7853, df = 56, p-value = 4.018e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6260700 0.8520232
sample estimates:
      cor 
0.7612612 
cor.test(starwars01$height, starwars01$mass, method = "spearman")
Cannot compute exact p-value with ties

    Spearman's rank correlation rho

data:  starwars01$height and starwars01$mass
S = 6946.8, p-value = 2.597e-13
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.786313 

Here are a parametric and non-parametric correlation test. The values I usually think about are cor (pearson), which is the R value, or rho(spearman) which is its “rho” which is a lot like an R value.

Numbers closer to 1 are more positively correlated, closer to -1 are more negatively correlated, closer to 0 are not correlated.

There is a p-value for both of these, which is essentially the probability that if all of the assumptions of the model hold, the true R or rho value is zero.

Regressions

Regressions are also pretty popular with the kids these days. Lets do one! Lets see how well height explains mass

mod <- lm(mass ~ height, data = starwars01)
summary(mod)

Call:
lm(formula = mass ~ height, data = starwars01)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.382  -8.212   0.211   3.846  57.327 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -32.54076   12.56053  -2.591   0.0122 *  
height        0.62136    0.07073   8.785 4.02e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.14 on 56 degrees of freedom
  (28 observations deleted due to missingness)
Multiple R-squared:  0.5795,    Adjusted R-squared:  0.572 
F-statistic: 77.18 on 1 and 56 DF,  p-value: 4.018e-12

Lets look at what this tells us. Residuals are essentially the distance of each point from the models prediction. The one that is farthest below is ~ -39 too low, the highest above is ~ 57 and the median and first and third quartiles are also shown.

Coefficients are often what you care about. The estemate tells us about the slope and intercept of the linear regression. These values basically say our model looks like this

mass = -32.5 + 0.62 * height

We don’t know the true values of these paramters, so the standard error gives us an idea of the ranges of the two. T-value is the test score to see how good those are, band the p-value tells you if the answer is statistically significant.

4.02 x 10^-12 < 0.002 so it the height value gets a bunch of stars after it. I usually don’t make too much of the p value of the intercept.

Residual standard error is the standard deviation of the residuals, according to the internet. 56 degrees of freedom is calculated as our sample size, minus the complexity of the model. We have 58 characters, but we loose one DF for the intercept and another for the height varaible that we are using.

There is also an R^2, and adjusted R^2 which gets smaller as the model gets more complicated, and a p-value for the whole model.

Residuals

Linear models like to assume that the residuals are normally distributed. We want to make sure that no person is really affecting our model too much. We can plot the model to do this.

Processing models

the broom package gives a nice matrix version of model results

library(broom)
tidy(mod)
LS0tCnRpdGxlOiAiQmlvVkNOIFIgTGVzc29uIDA3IC0tIFN0YXRpc3RpY3MiCmF1dGhvcjogIkphY29iIEEuIENyYW0iCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KClRvZGF5IHdlIGFyZSBnb2luZyB0byB0YWxrIGFib3V0IHJ1bm5pbmcgc29tZSBiYXNpYyBzdGF0aXN0aWNzIGluIFIuCgpgYGB7cn0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkocGxvdGx5KQpsaWJyYXJ5KGNvd3Bsb3QpCmBgYAoKVXN1YWxseSwgd2hlbiBJIHdhbnQgdG8gcnVuIHN0YXRzIG9uIGEgdGhpbmcsIEkgc3RhcnQgd2l0aCBzb21lIGJhc2ljIGRlc2NyaXB0aXZlIHN0YXRzLiBGb3J0dW5hdGVseSB3ZSBkaWQgdGhpcyBsYXN0IHdlZWsgd2l0aCB0aGUgc3RhcndhcnMgZGF0YXNldCwgc28gbGV0cyBqdW1wIGJhY2sgaW4uCgpgYGB7cn0KZGF0YShzdGFyd2FycykKYGBgCgpMb3RzIG9mIHN0YXRpc3RpY3MgYXJlIGFwcGxpZWQgdG8gY29udGludW91cyB2YXJpYWJsZXMuIExldHMgdGFrZSBhIGxvb2sgYXQgdGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIGhlaWdodCBhbmQgbWFzcy4KSSdtIGdvaW5nIHRvIG1ha2UgYW4gaW50ZXJhY3RpdmUgcGxvdCB3aXRoIGBnZ3Bsb3RseWAgc28gd2UgY2FuIG1vdXNlIG92ZXIgcG9pbnRzIGFuZCBzZWUgdGhlbSBiZXR0ZXIuCgpIZXJlIEkgaW5jbHVkZWQgYSBuYW1lIGFuZCBzcGVjIHZhcmlhYmxlLCBidXQgZGlkbid0IG1hcCB0aGVtIHRvIGFueXRoaW5nLiBUaGF0cyBzbyB5b3UgY2FuIHNlZSB0aGVtIHdoZW4geW91IG1vdXNlIG92ZXIgdGhlbS4KYGBge3J9CmhtUGxvdCA8LSBzdGFyd2FycyAlPiUgZ2dwbG90KGFlcyh4ID0gaGVpZ2h0LCB5ID0gbWFzcywgbmFtZSA9IG5hbWUsIHNwZWMgPSBzcGVjaWVzKSkgKyBnZW9tX3BvaW50KCkgKyB0aGVtZV9jb3dwbG90KCkKZ2dwbG90bHkoaG1QbG90KQpgYGAKCldvYWggZHVkZS4gT25lIGNoYXJhY3RlciBpcyBhbm9tYWxvdXNseSBoZWF2eS4gCk1vdXNlIG92ZXIgdGhlIG91dGxpZXIgc28geW91IGNhbiBzZWUgd2hvIGl0IGlzLgoKCiFbVGhhdCB3YXMgbXkgZmlyc3QgZ3Vlc3MgdG9vXShwaWNzLzQ5OHB4LUphYmJhX3RoZV9IdXR0LmpwZykKCkphYmJhIGlzIHByb2JhYmx5IGdvaW5nIHRvIHNjcmV3IHVwIGFsbCBvZiBvdXIgc3RhdHMsIGxldHMgZm9jdXMgb24gdGhlIHJlbGF0aW9uc2hpcCBvZiBhbGwgb2YgdGhlIG90aGVyIGNoYXJhY3RlcnNieSByZW1vdmluZyBoaW0uCgpgYGB7cn0Kc3RhcndhcnMwMSA8LSBzdGFyd2FycyAlPiUgZmlsdGVyKCFzdHJfZGV0ZWN0KG5hbWUsICJKYWJiYSIpKSAjIENvbXBsZWNhdGVkIHN5bnRhYyBiZWNhdXNlIEkgY2FuJ3Qgc3BlbGwgdGhlIHJlc3Qgb2YgaGlzIG5hbWUKYGBgCgpgYGB7cn0KaG1QbG90MDEgPC0gc3RhcndhcnMwMSAlPiUgZ2dwbG90KGFlcyh4ID0gaGVpZ2h0LCB5ID0gbWFzcywgbmFtZSA9IG5hbWUsIHNwZWMgPSBzcGVjaWVzKSkgKyBnZW9tX3BvaW50KCkgKyB0aGVtZV9jb3dwbG90KCkKZ2dwbG90bHkoaG1QbG90MDEpCmBgYAoKU28gaGVyZSdzIG91ciByZWxhdGlvbnNoaXAuIEl0IGxvb2tzIHNvcnQgb2YsIGJ1dCBub3QgcmVlYWxseSBsaW5lYXIsIHdoaWNoIG1ha2VzIHNlbnNlLCB0aGVyZSBhcmUgbG90cyBvZiBzcGVjaWVzIGluIHRoZSBnYWxheHkuCgpTb21ldGltZXMgd2hlbiB3ZSBkbyBzdGF0cywgdGhlIHN0YXRzIGxpa2UgdG8gaW1hZ2luZSBub3JtYWxseSBkaXN0cmlidXRlZCBkYXRhLgoKYGBge3J9CnN0YXJ3YXJzMDEgJT4lIHBpdm90X2xvbmdlcihjb2xzID0gbWFzczpoZWlnaHQsIG5hbWVzX3RvID0gIm1lYXN1cmVtZW50IiwgdmFsdWVzX3RvID0gInZhbHVlIikgJT4lIAogIGdncGxvdChhZXMoeCA9IHZhbHVlKSkgKyBmYWNldF93cmFwKG1lYXN1cmVtZW50fi4pICsgZ2VvbV9oaXN0b2dyYW0oKQpgYGAKCkhtbS4gU29ydCBvZi4KClNlZSBpZiB5b3UgY2FuIGZpZ3VyZSBvdXQgd2hhdCBJIGRpZCBhYm92ZS4gCgpMZXRzIHNlZSBpZiBoZWlnaHQgYW5kIHdpZHRoIGFyZSBjb3JyZWxhdGVkLgoKYGBge3J9CmNvci50ZXN0KHN0YXJ3YXJzMDEkaGVpZ2h0LCBzdGFyd2FyczAxJG1hc3MsIG1ldGhvZCA9ICJwZWFyc29uIikKY29yLnRlc3Qoc3RhcndhcnMwMSRoZWlnaHQsIHN0YXJ3YXJzMDEkbWFzcywgbWV0aG9kID0gInNwZWFybWFuIikKYGBgCgpIZXJlIGFyZSBhIHBhcmFtZXRyaWMgYW5kIG5vbi1wYXJhbWV0cmljIGNvcnJlbGF0aW9uIHRlc3QuIFRoZSB2YWx1ZXMgSSB1c3VhbGx5IHRoaW5rIGFib3V0IGFyZSBjb3IgKHBlYXJzb24pLCB3aGljaCBpcyB0aGUgUiB2YWx1ZSwgb3IgcmhvKHNwZWFybWFuKSB3aGljaCBpcyBpdHMgInJobyIgd2hpY2ggaXMgYSBsb3QgbGlrZSBhbiBSIHZhbHVlLgoKTnVtYmVycyBjbG9zZXIgdG8gMSBhcmUgbW9yZSAgcG9zaXRpdmVseSBjb3JyZWxhdGVkLCBjbG9zZXIgdG8gLTEgYXJlIG1vcmUgIG5lZ2F0aXZlbHkgY29ycmVsYXRlZCwgY2xvc2VyIHRvIDAgYXJlIG5vdCBjb3JyZWxhdGVkLiAKClRoZXJlIGlzIGEgcC12YWx1ZSBmb3IgYm90aCBvZiB0aGVzZSwgd2hpY2ggaXMgZXNzZW50aWFsbHkgdGhlIHByb2JhYmlsaXR5IHRoYXQgaWYgYWxsIG9mIHRoZSBhc3N1bXB0aW9ucyBvZiB0aGUgbW9kZWwgaG9sZCwgdGhlIHRydWUgUiBvciByaG8gdmFsdWUgaXMgemVyby4KCiMgUmVncmVzc2lvbnMKUmVncmVzc2lvbnMgYXJlIGFsc28gcHJldHR5IHBvcHVsYXIgd2l0aCB0aGUga2lkcyB0aGVzZSBkYXlzLiBMZXRzIGRvIG9uZSEKTGV0cyBzZWUgaG93IHdlbGwgYGhlaWdodGAgZXhwbGFpbnMgYG1hc3NgCgpgYGB7cn0KbW9kIDwtIGxtKG1hc3MgfiBoZWlnaHQsIGRhdGEgPSBzdGFyd2FyczAxKQpzdW1tYXJ5KG1vZCkKYGBgCgpMZXRzIGxvb2sgYXQgd2hhdCB0aGlzIHRlbGxzIHVzLgpSZXNpZHVhbHMgYXJlIGVzc2VudGlhbGx5IHRoZSBkaXN0YW5jZSBvZiBlYWNoIHBvaW50IGZyb20gdGhlIG1vZGVscyBwcmVkaWN0aW9uLiBUaGUgb25lIHRoYXQgaXMgZmFydGhlc3QgYmVsb3cgaXMgfiAtMzkgdG9vIGxvdywgdGhlIGhpZ2hlc3QgYWJvdmUgaXMgfiA1NyBhbmQgdGhlIG1lZGlhbiBhbmQgZmlyc3QgYW5kIHRoaXJkIHF1YXJ0aWxlcyBhcmUgYWxzbyBzaG93bi4KCkNvZWZmaWNpZW50cyBhcmUgb2Z0ZW4gd2hhdCB5b3UgY2FyZSBhYm91dC4gVGhlIGVzdGVtYXRlIHRlbGxzIHVzIGFib3V0IHRoZSBzbG9wZSBhbmQgaW50ZXJjZXB0IG9mIHRoZSBsaW5lYXIgcmVncmVzc2lvbi4gVGhlc2UgdmFsdWVzIGJhc2ljYWxseSBzYXkgb3VyIG1vZGVsIGxvb2tzIGxpa2UgdGhpcwoKbWFzcyA9IC0zMi41ICsgMC42MiAqIGhlaWdodAoKV2UgZG9uJ3Qga25vdyB0aGUgdHJ1ZSB2YWx1ZXMgb2YgdGhlc2UgcGFyYW10ZXJzLCBzbyB0aGUgc3RhbmRhcmQgZXJyb3IgZ2l2ZXMgdXMgYW4gaWRlYSBvZiB0aGUgcmFuZ2VzIG9mIHRoZSB0d28uIFQtdmFsdWUgaXMgdGhlIHRlc3Qgc2NvcmUgdG8gc2VlIGhvdyBnb29kIHRob3NlIGFyZSwgYmFuZCB0aGUgcC12YWx1ZSB0ZWxscyB5b3UgaWYgdGhlIGFuc3dlciBpcyBzdGF0aXN0aWNhbGx5IHNpZ25pZmljYW50LgoKNC4wMiB4IDEwXi0xMiA8IDAuMDAyIHNvIGl0IHRoZSBoZWlnaHQgdmFsdWUgZ2V0cyBhIGJ1bmNoIG9mIHN0YXJzIGFmdGVyIGl0LiBJIHVzdWFsbHkgZG9uJ3QgbWFrZSB0b28gbXVjaCBvZiB0aGUgcCB2YWx1ZSBvZiB0aGUgaW50ZXJjZXB0LgoKUmVzaWR1YWwgc3RhbmRhcmQgZXJyb3IgaXMgdGhlIHN0YW5kYXJkIGRldmlhdGlvbiBvZiB0aGUgcmVzaWR1YWxzLCBhY2NvcmRpbmcgdG8gdGhlIGludGVybmV0Lgo1NiBkZWdyZWVzIG9mIGZyZWVkb20gaXMgY2FsY3VsYXRlZCBhcyBvdXIgc2FtcGxlIHNpemUsIG1pbnVzIHRoZSBjb21wbGV4aXR5IG9mIHRoZSBtb2RlbC4gIFdlIGhhdmUgNTggY2hhcmFjdGVycywgYnV0IHdlIGxvb3NlIG9uZSBERiBmb3IgdGhlIGludGVyY2VwdCBhbmQgYW5vdGhlciBmb3IgdGhlIGhlaWdodCB2YXJhaWJsZSB0aGF0IHdlIGFyZSB1c2luZy4KClRoZXJlIGlzIGFsc28gYW4gUl4yLCBhbmQgYWRqdXN0ZWQgUl4yIHdoaWNoIGdldHMgc21hbGxlciBhcyB0aGUgbW9kZWwgZ2V0cyBtb3JlIGNvbXBsaWNhdGVkLCBhbmQgYSBwLXZhbHVlIGZvciB0aGUgd2hvbGUgbW9kZWwuCgojIyBSZXNpZHVhbHMKTGluZWFyIG1vZGVscyBsaWtlIHRvIGFzc3VtZSB0aGF0IHRoZSByZXNpZHVhbHMgYXJlIG5vcm1hbGx5IGRpc3RyaWJ1dGVkLiBXZSB3YW50IHRvIG1ha2Ugc3VyZSB0aGF0IG5vIHBlcnNvbiBpcyByZWFsbHkgYWZmZWN0aW5nIG91ciBtb2RlbCB0b28gbXVjaC4gV2UgY2FuIHBsb3QgdGhlIG1vZGVsIHRvIGRvIHRoaXMuCgpgYGB7cn0KcGxvdChtb2QpCmBgYAoKIyMgUHJvY2Vzc2luZyBtb2RlbHMKCnRoZSBgYnJvb21gIHBhY2thZ2UgZ2l2ZXMgYSBuaWNlIG1hdHJpeCB2ZXJzaW9uIG9mIG1vZGVsIHJlc3VsdHMKCmBgYHtyfQpsaWJyYXJ5KGJyb29tKQp0aWR5KG1vZCkKYGBgCgoK